Collocations as Word Co-ocurrence Restriction Data - An Application to Japanese Word Processor

نویسندگان

  • Kosho Shudo
  • Masahito Takahashi
  • Yasuo Koyama
  • Kenji Yoshimura
چکیده

Collocations, the combination of specific words are quite useful linguistic resources for NLP in general. The purpose of this paper is to show their usefulness, exemplifying an application to Kanji character decision processes for Japanese word processors. Unlike recent trials of automatic extraction, our collocations were collected manually through many years of intensive investigation of corpus. Our collection procedure consists of (1) finding a proper combination of words in a corpus and (2) recollecting similar combinations of words, incited by it. This procedure, which depends on human judgment and the enrichment of data by association, is effective for remedying the sparseness of data problem, although the arbitrariness of human judgment is inevitable. Approximately seventy two thousand and four hundred collocations were used as word co-occurrence restriction data for deciding Kanji characters in the processing of Japanese word processores. Experiments have shown that the collocation data yield 8.9% higher fraction of Kana-toKanji character conversion accuracy than the system which uses no collocation data and 7.0% higher, than a commercial word processor software of average performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting Bilingual Collocations from Non-Aligned Parallel Corpora

This paper proposes a new method to find correspondences of uninterrupted collocations from Japanese-English bilingual corpora without sentence-to-sentence alignment. Uninterrupted collocations in English such as “once again”, “give up”, or “gross national product” handled as a single word or a compound word in Japanese, can be automatically extracted with corresponding Japanese words using wor...

متن کامل

Large Scale Collocation Data and Their Application to Japanese Word Processor Technology

Word processors or computers used in Japan employ Japanese input method through keyboard stroke combined with Kana (phonetic) character to Kanji (ideographic, Chinese) character conversion technology. The key factor of Kana-to-Kanji conversion technology is how to raise the accuracy of the conversion through the homophone processing, since we have so many homophonic Kanjis. In this paper, we re...

متن کامل

Intellectual structure of knowledge in Nanomedicine field (2009 to 2018): A Co-Word ‎Analysis

Introduction: The Co-word analysis has the ability to identify the intellectual structure of knowledge ‎in a research domain and reveal its subsurface research aspects.‎ Objective: This study examines the intellectual structure of knowledge in the field of nanomedicine ‎during the period of 2009 to 2018 by using Co-word analysis.‎ Materials and Methods: This paper develops a sciento...

متن کامل

Survey of Word Co-occurrence Measures for Collocation Detection

This paper presents a detailed survey of word co-occurrence measures used in natural language processing. Word co-occurrence information is vital for accurate computational text treatment, it is important to distinguish words which can combine freely with other words from other words whose preferences to generate phrases are restricted. The latter words together with their typical co-occurring ...

متن کامل

Integrating Morphology With Multi-Word Expression Processing In Turkish

This paper describes a multi-word expression processor for preprocessing Turkish text for various language engineering applications. In addition to the fairly standard set of lexicalized collocations and multi-word expressions such as named-entities, Turkish uses a quite wide range of semi-lexicalized and non-lexicalized collocations. After an overview of relevant aspects of Turkish, we present...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000